Discrete
Bernoulli distribution
f X ( x ) = P ( X = x ) = { ( 1 − p ) 1 − x p x for x = 0 or 1 0 otherwise
f_{X}(x)=P(X=x)=\left\{\begin{array}{cl}
(1-p)^{1-x} p^{x} & \text { for } \mathrm{x}=0 \text { or } 1 \\
0 & \text { otherwise }
\end{array}\right.
f X ( x ) = P ( X = x ) = { ( 1 − p ) 1 − x p x 0 for x = 0 or 1 otherwise
expectation
variance
v a r ( X ) = ( 1 − p ) p var(X) = (1-p)p v a r ( X ) = ( 1 − p ) p
Binomial distribution
f X ( k ) = P ( X = k ) = { C n k p k ( 1 − p ) n − k for k = 0 , 1 , … , n 0 otherwise
f_{X}(k)=P(X=k)=\left\{\begin{aligned}
C_{n}^{k} p^{k}(1-p)^{n-k} & \text { for } \mathrm{k}=0,1, \ldots, \mathrm{n} \\
0 & \text { otherwise }
\end{aligned}\right.
f X ( k ) = P ( X = k ) = { C n k p k ( 1 − p ) n − k 0 for k = 0 , 1 , … , n otherwise
expectation
variance
v a r ( X ) = n p ( 1 − p ) var(X) = np(1-p) v a r ( X ) = n p ( 1 − p )
Geometric distribution
f X ( k ) = P ( X = k ) = { p ( 1 − p ) k − 1 for k = 1 , 2 , 3 … 0 otherwise
f_{X}(k)=P(X=k)=\left\{\begin{aligned}
p(1-p)^{k-1} & \text { for } \mathrm{k}=1,2,3 \ldots \\
0 & \text { otherwise }
\end{aligned}\right.
f X ( k ) = P ( X = k ) = { p ( 1 − p ) k − 1 0 for k = 1 , 2 , 3 … otherwise
expectation
E ( X ) = 1 P E(X) = \frac{1}{P} E ( X ) = P 1
variance
v a r ( X ) = 1 − P P 2 var(X) = \frac{1-P}{P^2} v a r ( X ) = P 2 1 − P
memoryless
P ( X > m + n ∣ X > m ) = P ( X > n ) P(X>m+n|X>m) = P(X>n) P ( X > m + n ∣ X > m ) = P ( X > n )
Negative binomial distribution(Pascal)
The negative binomial distribution arises as a generalization of the geometric distribution.
Suppose that a sequence of independent trials each with probability of success p p p is performed until there are r r r successes in all.
so can be denote as p ⋅ C k − 1 r − 1 p r − 1 ( 1 − p ) ( k − 1 ) − ( r − 1 ) p \cdot C_{k-1}^{r-1} p^{r-1}(1-p)^{(k-1)-(r-1)} p ⋅ C k − 1 r − 1 p r − 1 ( 1 − p ) ( k − 1 ) − ( r − 1 )
X ∼ N B ( r , p ) X\sim NB(r,p) X ∼ N B ( r , p )
f X ( k ) = P ( X = k ) = { C k − 1 r − 1 p r ( 1 − p ) k − r for k = r , r + 1 , r + 2 … 0 otherwise
f_{X}(k)=P(X=k)=\left\{\begin{aligned}
C_{k-1}^{r-1} p^{r}(1-p)^{k-r} & \text { for } \mathrm{k}=\mathrm{r}, \mathrm{r}+1, \mathrm{r}+2 \ldots \\
0 & \text { otherwise }
\end{aligned}\right.
f X ( k ) = P ( X = k ) = { C k − 1 r − 1 p r ( 1 − p ) k − r 0 for k = r , r + 1 , r + 2 … otherwise
expectation
E ( X ) = r p E(X) = \frac{r}{p} E ( X ) = p r
variance
v a r ( X ) = r ( 1 − p ) p 2 var(X) = \frac{r(1-p)}{p^2} v a r ( X ) = p 2 r ( 1 − p )
the conduct method can be seen there .
Hypergeometric distribution
Suppose that an urn contains n n n balls, of which r r r are black and n − r n-r n − r are white. Let X X X denote the number of black balls drawn when taking m m m balls without replacement.
denoted as X ∼ h ( m , n , r ) X\sim h(m,n,r) X ∼ h ( m , n , r )
pmf
f X ( k ) = P ( X = k ) = { C k − 1 r − 1 p r ( 1 − p ) k − r for k = r , r + 1 , r + 2 … 0 otherwise
f_{X}(k)=P(X=k)=\left\{\begin{array}{cl}
C_{k-1}^{r-1} p^{r}(1-p)^{k-r} & \text { for } \mathrm{k}=\mathrm{r}, \mathrm{r}+1, \mathrm{r}+2 \ldots \\
0 & \text { otherwise }
\end{array}\right.
f X ( k ) = P ( X = k ) = { C k − 1 r − 1 p r ( 1 − p ) k − r 0 for k = r , r + 1 , r + 2 … otherwise
expectation
E ( X ) = m r n E(X) = m\frac{r}{n} E ( X ) = m n r
variance
v a r ( X ) = m r ( n − m ) ( n − r ) n 2 ( n − 1 ) var(X) = \frac{mr(n-m)(n-r)}{n^2(n-1)} v a r ( X ) = n 2 ( n − 1 ) m r ( n − m ) ( n − r )
Poisson distribution
can be derived as the limit of a binomial distribution as the number of trials approaches infinity and the probability of success on each trial approaches zero in such a way that n p = λ np = \lambda n p = λ ,λ \lambda λ can be seen as the successful trials
pmf
P ( X = k ) = λ k k ! e − λ k = 0 , 1 , 2 . . . P(X = k) = \frac{\lambda^k }{k!} e^{-\lambda} \quad k = 0,1,2... P ( X = k ) = k ! λ k e − λ k = 0 , 1 , 2 . . .
expectation
E ( X ) = λ E(X) = \lambda E ( X ) = λ
variance
v a r ( X ) = λ var(X) = \lambda v a r ( X ) = λ
Property
Let X X X and Y Y Y are independent Poisson r.v.s with parameters θ 1 \theta_1 θ 1 and θ 2 \theta_2 θ 2 , and X + Y ∼ P o s s i o n ( θ 1 + θ 2 ) X+Y \sim Possion(\theta_1 + \theta_2) X + Y ∼ P o s s i o n ( θ 1 + θ 2 )
Continuous
A uniform r.v on the interval [a,b] is a model for what we mean when we say "choose a number at random between a and b"
pdf
f X ( x ) = { 1 b − a a ≤ x ≤ b 0 otherwise
f_{X}(x)=\left\{\begin{aligned}
\frac{1}{b-a} & a \leq x \leq b \\
0 & \text { otherwise }
\end{aligned}\right.
f X ( x ) = ⎩ ⎨ ⎧ b − a 1 0 a ≤ x ≤ b otherwise
F X ( x ) = { 0 x ≤ a x − a b − a a ≤ x ≤ b 1 b ≤ x
F_{X}(x)=\left\{\begin{array}{rl}
0 & x \leq a \\
\frac{x-a}{b-a} & a \leq x \leq b \\
1 & b \leq x
\end{array}\right.
F X ( x ) = ⎩ ⎨ ⎧ 0 b − a x − a 1 x ≤ a a ≤ x ≤ b b ≤ x
expectation
E ( X ) = a + b 2 E(X) = \frac{a+b}{2} E ( X ) = 2 a + b
variance
v a r ( X ) = ( b − a ) 2 1 2 var(X) = \frac{(b-a)^2}{12} v a r ( X ) = 1 2 ( b − a ) 2
Exponential distribution
Exponential distribution is often used to model lifetimes or waiting times, in which context it is conventional to replace x x x by t t t .
pdf
f X ( x ) = { λ e − λ x x ≥ 0 0 otherwise
f_{X}(x)=\left\{\begin{array}{rl}
\lambda e^{-\lambda x} & x \geq 0 \\
0 & \text { otherwise }
\end{array}\right.
f X ( x ) = { λ e − λ x 0 x ≥ 0 otherwise
F X ( x ) = { 1 − e − λ x x ≥ 0 0 otherwise
F_{X}(x)=\left\{\begin{array}{rl}
1-e^{-\lambda x} & x \geq 0 \\
0 & \text { otherwise }
\end{array}\right.
F X ( x ) = { 1 − e − λ x 0 x ≥ 0 otherwise
expectation
E ( X ) = 1 λ E(X) = \frac{1}{\lambda} E ( X ) = λ 1
variance
v a r ( X ) = 1 λ 2 var(X) = \frac{1}{\lambda^2} v a r ( X ) = λ 2 1
property
let X , Y X,Y X , Y are independent Poisson r.v.s with θ 1 , θ 2 \theta_1,\theta_2 θ 1 , θ 2 ,then X + Y ∼ P o i s s o n ( θ 1 + θ 2 ) X+Y\sim Poisson (\theta_1+\theta_2) X + Y ∼ P o i s s o n ( θ 1 + θ 2 )
Memoryless
P ( X > s + t ∣ X > s ) = P ( X > t ) P(X > s+t | X> s) = P(X>t) P ( X > s + t ∣ X > s ) = P ( X > t )
Gamma distribution
g ( t ) = { λ α τ ( α ) t α − 1 e − λ t t ≥ 0 0 otherwise
g(t)=\left\{\begin{array}{rl}
\frac{\lambda^{\alpha}}{\tau(\alpha)} t^{\alpha-1} e^{-\lambda t} & t \geq 0 \\
0 & \text { otherwise }
\end{array}\right.
g ( t ) = { τ ( α ) λ α t α − 1 e − λ t 0 t ≥ 0 otherwise
τ ( x ) = ∫ 0 ∞ u x − 1 e − u d u , x > 0 \tau(x) = \int _0^\infty u^{x-1}e^{-u}du,x>0 τ ( x ) = ∫ 0 ∞ u x − 1 e − u d u , x > 0
expectation
E ( X ) = α λ E(X) = \frac{\alpha}{\lambda} E ( X ) = λ α
variance
V a r ( X ) = α λ 2 Var(X)= \frac{\alpha}{\lambda ^2} V a r ( X ) = λ 2 α
Property
G a ( 1 , λ ) = exp ( λ ) Ga(1,\lambda) = \exp (\lambda) G a ( 1 , λ ) = exp ( λ )
G a ( n 2 , 1 2 ) = χ 2 ( n ) Ga(\frac{n}{2},\frac{1}{2}) = \chi ^2 (n) G a ( 2 n , 2 1 ) = χ 2 ( n )
E ( X ) = n E(X) = n E ( X ) = n
V a r ( X ) = 2 n Var(X) = 2n V a r ( X ) = 2 n
X ∼ G a ( α , λ ) → k X ∼ G a ( α , λ k ) , k > 0 X\sim Ga(\alpha,\lambda) \to kX\sim Ga(\alpha,\frac{\lambda}{k}),k>0 X ∼ G a ( α , λ ) → k X ∼ G a ( α , k λ ) , k > 0
if X ∼ G a ( α , λ ) , Y ∼ G a ( β , λ ) , i . i . d X\sim Ga(\alpha,\lambda),Y\sim Ga(\beta,\lambda),i.i.d X ∼ G a ( α , λ ) , Y ∼ G a ( β , λ ) , i . i . d ,then X + Y ∼ G a ( α + β , λ ) X+Y \sim Ga(\alpha+\beta ,\lambda) X + Y ∼ G a ( α + β , λ )
conduct
∵ τ ( α ) = ∫ 0 ∞ x α − 1 e − x d x \because \tau(\alpha ) =\int_{0}^{\infty} x^{\alpha-1}e^{-x}dx ∵ τ ( α ) = ∫ 0 ∞ x α − 1 e − x d x
∴ x = λ t , → τ ( α ) = λ α ∫ 0 ∞ t α − 1 e − λ t d t \therefore x = \lambda t,\to \tau (\alpha) = \lambda^\alpha \int _{0}^{\infty} t^{\alpha-1}e^{-\lambda t}dt ∴ x = λ t , → τ ( α ) = λ α ∫ 0 ∞ t α − 1 e − λ t d t
∴ 1 τ ( α ) λ α ∫ 0 ∞ t α − 1 e − λ t d t = 1 \therefore \frac{1}{\tau (\alpha)}\lambda^\alpha \int _{0}^{\infty} t^{\alpha-1}e^{-\lambda t}dt = 1 ∴ τ ( α ) 1 λ α ∫ 0 ∞ t α − 1 e − λ t d t = 1
∴ g ( t ) = λ α τ ( α ) t α − 1 e − λ t \therefore g(t) =\frac{\lambda^\alpha}{\tau(\alpha)}t^{\alpha-1}e^{-\lambda t} ∴ g ( t ) = τ ( α ) λ α t α − 1 e − λ t
α \alpha α is called a shape parameter for the gamma density,
Varying α \alpha α changes the shape of the density
λ \lambda λ is called a scale parameter
Varying λ \lambda λ corresponds to changing the units of measurement and does not affect the shape of the density
how to understand gamma?
Normal distribution
g ( t ) = { 1 σ 2 π e − ( x − μ ) 2 / ( 2 σ 2 ) t ≥ 0 0 otherwise
g(t)=\left\{\begin{aligned}
\frac{1}{\sigma \sqrt{2 \pi}} e^{-(x-\mu)^{2} /\left(2 \sigma^{2}\right)} & t \geq 0 \\
0 & \text { otherwise }
\end{aligned}\right.
g ( t ) = ⎩ ⎪ ⎨ ⎪ ⎧ σ 2 π 1 e − ( x − μ ) 2 / ( 2 σ 2 ) 0 t ≥ 0 otherwise
μ \mu μ is the mean
σ \sigma σ is the standard deviation
If X ∼ N ( μ ; σ 2 ) X \sim N(\mu; \sigma^2) X ∼ N ( μ ; σ 2 ) ,and Y = a X + b Y = aX + b Y = a X + b , then Y ∼ N ( a μ + b , a 2 σ 2 ) Y \sim N(a\mu+b,a^2\sigma^2) Y ∼ N ( a μ + b , a 2 σ 2 )
especially, if X ∼ N ( μ , σ 2 ) X \sim N(\mu,\sigma^2) X ∼ N ( μ , σ 2 ) , then Z = x − μ σ ∼ N ( 0 , 1 ) Z = \frac{x-\mu}{\sigma}\sim N(0,1) Z = σ x − μ ∼ N ( 0 , 1 )
a X + b Y ∼ N ( a μ X + b μ Y , a 2 σ X 2 + b 2 σ Y 2 + 2 a b ρ σ X σ Y ) aX+bY \sim N(a\mu_X+b\mu_Y,a^2\sigma_X^2 + b^2\sigma_Y^2 + 2ab\rho \sigma_X\sigma_Y) a X + b Y ∼ N ( a μ X + b μ Y , a 2 σ X 2 + b 2 σ Y 2 + 2 a b ρ σ X σ Y )
property
if X , Y ∼ N ( 0 , 1 ) X,Y \sim N(0,1) X , Y ∼ N ( 0 , 1 ) ,then U = X Y U = \frac{X}{Y} U = Y X is Cauchy r.v (lec3)
f U ( u ) = 1 π ( u 2 + 1 ) f_U(u) = \frac{1}{\pi (u^2+1)} f U ( u ) = π ( u 2 + 1 ) 1
if X 1 , . . , X n ∼ N ( 0 , 1 ) X_1,..,X_n\sim N(0,1) X 1 , . . , X n ∼ N ( 0 , 1 ) ,i.i.d,, then
X 1 2 + . . . X n 2 ∼ χ 2 ( n ) X_1^2 + ... X_n^2 \sim \chi^2(n) X 1 2 + . . . X n 2 ∼ χ 2 ( n )
Logistic distribution
consider the special logistic distribution(0,1):
F X ( x ) = 1 1 + e − x F_X(x) = \frac{1}{1+e^{-x}} F X ( x ) = 1 + e − x 1
Exponential family
A family of pdfs or pmfs is called an exponential family if it can
be expressed as:
p ( x , θ ) = H ( x ) exp ( θ T ϕ ( x ) − A ( θ ) ) p(x,\theta) = H(x)\exp(\theta^T \phi(x) - A(\theta)) p ( x , θ ) = H ( x ) exp ( θ T ϕ ( x ) − A ( θ ) )
H ( x ) H(x) H ( x ) is the normalization factor
It is very helpful to model heterogeneous data in the era of big data.
Bernoulli, Gaussian, Binomial, Poisson, Exponential, Weibull, Laplace, Gamma, Beta, Multinomial, Wishart distributions are all exponential families
for Bernoulli:
X ∼ p x ( 1 − p ) 1 − x , f o r x ∈ { 0 , 1 } X\sim p^x(1-p)^{1-x}, for x\in \{0,1\} X ∼ p x ( 1 − p ) 1 − x , f o r x ∈ { 0 , 1 }
P x ( 1 − P ) 1 − x = exp { x ln p + ( 1 − x ) ln ( 1 − p ) } = exp { ln p 1 − p x + ln ( 1 − p ) } P^x(1-P)^{1-x} = \exp\{x\ln p + (1-x)\ln (1-p)\} = \exp\{\ln \frac{p}{1-p} x + \ln (1-p)\} P x ( 1 − P ) 1 − x = exp { x ln p + ( 1 − x ) ln ( 1 − p ) } = exp { ln 1 − p p x + ln ( 1 − p ) }
θ = ln p 1 − p , ϕ ( x ) = x , A ( θ ) = ln 1 1 − p , H ( x ) = 1 \theta =\ln \frac{p}{1-p}, \phi(x) = x,A(\theta ) = \ln\frac{1}{1-p},H(x) = 1 θ = ln 1 − p p , ϕ ( x ) = x , A ( θ ) = ln 1 − p 1 , H ( x ) = 1
the explain can be seen here
Sample
V a r ( X ˉ ) = σ 2 n Var(\bar{X} ) = \frac{\sigma^2}{n} V a r ( X ˉ ) = n σ 2
( n − 1 ) S 2 = ∑ X 2 − n X ˉ 2 (n-1)S^2 = \sum X^2 - n\bar{X}^2 ( n − 1 ) S 2 = ∑ X 2 − n X ˉ 2
X ˉ \bar{X} X ˉ 和S 2 S^2 S 2 相互独立
X ˉ ∼ N ( μ , σ 2 n ) \bar{X} \sim N(\mu,\frac{\sigma^2}{n}) X ˉ ∼ N ( μ , n σ 2 )
( n − 1 ) S 2 σ 2 ∼ χ 2 ( n − 1 ) \frac{(n-1)S^2}{\sigma^2}\sim \chi^2(n-1) σ 2 ( n − 1 ) S 2 ∼ χ 2 ( n − 1 )
Property
E ( X ) = E ( E ( X ∣ Y ) ) E(X) = E(E(X|Y)) E ( X ) = E ( E ( X ∣ Y ) )
V a r ( X ) = E ( V a r ( X ∣ Y ) ) + V a r ( E ( X ∣ Y ) ) Var(X) = E(Var(X|Y)) + Var(E(X|Y)) V a r ( X ) = E ( V a r ( X ∣ Y ) ) + V a r ( E ( X ∣ Y ) )
if r.v.s X and Y are independent, E ( X ∣ Y ) = E ( X ) E(X|Y) = E(X) E ( X ∣ Y ) = E ( X )
Inequality
Markov's inequality
P ( X ≥ a ) ≤ E ( X ) a P(X\ge a) \le \frac{E(X)}{a} P ( X ≥ a ) ≤ a E ( X )
Chebyshev's inequality
P ( ∣ X − E ( X ) ∣ ≥ a ) ≤ V a r ( X ) a 2 P(|X-E(X)| \ge a) \le \frac{Var(X)}{a^2} P ( ∣ X − E ( X ) ∣ ≥ a ) ≤ a 2 V a r ( X )
Chernoff bounds
The generic Chernoff bound requires only the moment generating function of X X X , defined as M X ( t ) = E ( e t X ) M_X(t) = E(e^{tX}) M X ( t ) = E ( e t X ) , provided it exists.
P ( X ≥ a ) ≤ E ( e t x ) e t ⋅ a P(X\ge a) \le \frac{E(e^{tx})}{e^{t\cdot a}} P ( X ≥ a ) ≤ e t ⋅ a E ( e t x )
other inequalities can be seen here .